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SUMMARY  This  paper  presents  an  architecture  and  a  synthesis  method 
for  compact  numerical  function  generators  (NFGs)  for  trigonometric,  log¬ 
arithmic,  square  root,  reciprocal,  and  combinations  of  these  functions.  Our 
NFG  partitions  a  given  domain  of  the  function  into  non-uniform  segments 
using  an  LUT  cascade,  and  approximates  the  given  function  by  a  quadratic 
polynomial  for  each  segment.  Thus,  we  can  implement  fast  and  compact 
NFGs  for  a  wide  range  of  functions.  Experimental  results  show  that:  1)  our 
NFGs  require,  on  average,  only  4%  of  the  memory  needed  by  NFGs  based 
on  the  linear  approximation  with  non-uniform  segmentation;  2)  our  NFG 
for  2X  -  1  requires  only  22%  of  the  memory  needed  by  the  NFG  based  on 
a  5th-order  approximation  with  uniform  segmentation;  and  3)  our  NFGs 
achieve  about  70%  of  the  throughput  of  the  existing  table-based  NFGs  us¬ 
ing  only  a  few  percent  of  the  memory.  Thus,  our  NFGs  can  be  implemented 
with  more  compact  FPGAs  than  needed  for  the  existing  NFGs.  Our  auto¬ 
matic  synthesis  system  generates  such  compact  NFGs  quickly. 
key  words:  LUT  cascades,  2nd-order  Chebyshev  approximation,  non- 
uniform  segmentation,  NFGs,  automatic  synthesis,  FPGA 

1.  Introduction 

Numerical  functions  fix),  such  as  trigonometric,  logarith¬ 
mic,  square  root,  reciprocal,  and  combinations  of  these  func¬ 
tions,  are  extensively  used  in  computer  graphics,  digital 
signal  processing,  communication  systems,  robotics,  astro¬ 
physics,  fluid  physics,  etc.  To  compute  elementary  func¬ 
tions,  iterative  algorithms,  such  as  the  CORDIC  (Coordi¬ 
nate  Rotation  Digital  Computer)  algorithm  [1],  [23],  have 
been  often  used.  Although  the  CORDIC  algorithm  achieves 
accuracy  with  compact  hardware,  its  computation  time  is 
proportional  to  the  precision  (i.e.  the  number  of  bits).  For 
a  function  composed  of  elementary  functions,  the  CORDIC 
algorithm  is  slower,  since  it  computes  each  elementary  func¬ 
tion  sequentially.  It  is  too  slow  for  numerically  intensive 
applications.  Implementation  by  a  single  lookup  table  for 
fix)  is  simple  and  very  fast.  For  low-precision  computations 
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Fig.  1  Synthesis  flow  for  NFGs. 


of  fix)  (e.g.  x  and  fix)  have  8  bits),  this  implementation  is 
straightforward.  For  high-precision  computations,  however, 
the  single  lookup  table  implementation  is  impractical  due  to 
the  huge  table  size. 

To  reduce  the  memory  size,  polynomial  approxima¬ 
tions  have  been  used  [9],  [10],  [21],  [22].  These  methods  ap¬ 
proximate  the  given  numerical  functions  by  piecewise  poly¬ 
nomials,  and  realize  the  polynomials  with  hardware.  Lin¬ 
ear  approximations  offer  fast  and  middle-precision  evalua¬ 
tion  of  numerical  functions.  However,  for  high-precision, 
these  methods  also  require  excessive  memory  size.  And, 
the  methods  proposed  so  far  are  ad-hoc  and  not  systematic. 
This  paper  proposes  an  architecture  and  a  systematic  synthe¬ 
sis  method  for  NFGs  based  on  quadratic  approximation.  By 
using  the  LUT  cascade  [8],  many  numerical  functions  are 
efficiently  approximated  by  piecewise  quadratic  functions, 
and  are  realized  by  NFGs  with  small  memory  size.  Our  syn¬ 
thesis  method  is  fully  automated,  so  that  fast  and  compact 
NFGs  can  be  produced  by  non-experts.  Figure  1  shows  the 
synthesis  flow  for  the  NFG.  It  converts  the  Design  Specifi¬ 
cation  described  by  Scilab  [19],  a  MATLAB-like  software, 
into  HDL  code.  The  Design  Specification  consists  of  a  func¬ 
tion  fix),  a  domain  over  x,  and  an  accuracy.  This  system 
first  partitions  the  domain  into  segments,  and  then  approxi¬ 
mates  fix)  by  a  quadratic  function  in  each  segment.  Next,  it 
analyzes  the  errors,  and  derives  the  necessary  precision  for 
computing  units  in  the  NFG.  Then,  it  generates  HDL  code 
to  be  mapped  into  an  FPGA  using  an  FPGA  vendor  tool. 

2.  Preliminaries 

Definition  1:  The  binary  fixed-point  representation  of  a 
value  r  has  the  form 

dn  J,nt—\  dn_int—2  •  •  •  d\  d().  d—  i  d— 2  •  •  ■  d—n_fract  (1) 

where  d-,  e  [0, 1],  nJnt  is  the  number  of  bits  for  the  integer 
part,  and  n_frac  is  the  number  of  bits  for  the  fractional  part 
of  r.  The  representation  in  (1)  is  two’s  complement,  and  so 

nJ.nt-2 

r  =  _2nJn,-l  dnJnt_l+  2%.  (2) 

i=—n-frac 
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Definition  2:  Error  is  the  absolute  difference  between  the 
original  value  and  the  approximated  value.  Approxima¬ 
tion  error  is  the  error  caused  by  a  function  approximation. 
Rounding  error  is  the  error  caused  by  a  binary  fixed-point 
representation.  It  is  the  result  of  truncation  or  rounding, 
whichever  is  applied.  However,  both  operations  yield  an 
error  that  is  called  rounding  error.  Acceptable  error  is  the 
maximum  error  that  an  NFG  may  assume.  Acceptable  ap¬ 
proximation  error  is  the  maximum  approximation  error  that 
a  function  approximation  may  assume. 

For  example,  when  a  function  is  approximated  by  a 
piecewise  linear  function,  approximation  error  occurs  be¬ 
cause  the  approximating  function  does  not  equal  the  approx¬ 
imated  function  everywhere.  Typically,  there  is  a  non-zero 
error  over  most  of  the  function  values;  indeed,  this  error 
is  usually  0  only  for  a  very  few  function  values.  Round¬ 
ing  error  occurs  because  the  binary  representation  of  the 
function  (2)  takes  on  only  specific  values.  For  example,  if 
the  function  value  is  y/2,  there  is  no  representation  in  the 
form  (2),  where  n_frac  is  finite,  that  is  exactly  V2. 

Definition  3:  Precision  is  the  total  number  of  bits  for  a  bi¬ 
nary  fixed-point  representation.  Specially,  n-bit  precision 
specifies  that  n  bits  are  used  to  represent  the  number;  that  is, 
n  -  nJnt  +  n_frac.  An  n-bit  precision  NFG  has  an  n-bit 
input. 

Definition  4:  Accuracy  is  the  number  of  bits  in  the  frac¬ 
tional  part  of  a  binary  fixed-point  representation.  Specially, 
m-bit  accuracy  specifies  that  m  bits  are  used  to  represent  the 
fractional  part  of  the  number;  that  is,  m  -  n_frac.  An  m-bit 
accuracy  NFG  is  an  NFG  with  an  m-bit  fractional  part  of 
the  input,  an  m-bit  fractional  part  of  the  output,  and  a  2 
acceptable  error. 

Definition  5:  Truncation  is  the  process  of  removing  lower 
order  bits  from  a  binary  fixed-point  number.  If  r  is  repre¬ 
sented  as  /  dnJnt  \  dnjnt  2  •  ■  •  d{ ) .  cl  ]  . . .  d  a _frac  and  is 
truncated  to  r'  =  dnjnl- 1  dnjnt- 2  ...  do.  cLi  . . .  d-j,  then 
the  resulting  rounding  error  is  at  most  2-'  -  2~"-drac,  where 
i  <  ii-frac.  Truncation  of  r  never  produces  a  value  larger 
than  r. 

Definition  6:  Rounding  is  truncation  if  <jL(,+i)  =  0.  If 
d-(i+ 1)  =  1.2“'  is  added  to  the  result  of  truncation.  That 
is,  if  r  is  represented  as 

I  —  d,i_jnT  !  dn_int— 2  .  .  .  cl(\.  d  \  .  .  .  d-n_fraCy 

r  is  rounded  to 

r  —  clnjn;  [  dn_int~ 2  . . .  do.  d  \  . . .  d—i  +  cl  oc  \ »— 

The  error  caused  by  rounding  is  at  most  2_(,+1). 

Truncation  can  cause  a  larger  error  than  rounding.  On 
the  other  hand,  truncation  requires  less  hardware.  In  our 
architecture,  we  use  both  truncation  and  rounding.  This  is 
discussed  in  Appendix. 


3.  Piecewise  Quadratic  Approximation 

To  approximate  the  numerical  function  fix)  using  quadratic 
functions,  we  first  partition  the  domain  for  x  into  segments. 
For  each  segment,  we  approximate  fix)  using  a  quadratic 
function  g(x)  =  C2X2  +  cix  +  cq.  In  this  case,  we  seek  the 
fewest  segments,  since  this  reduces  the  memory  size  needed 
for  storing  the  coefficients  of  the  quadratic  functions. 

For  piecewise  polynomial  approximations,  in  many 
cases,  the  domain  is  partitioned  into  uniform  segments  [2]- 
[4],  [21],  [22],  Such  methods  are  simple  and  fast,  but  for 
some  kinds  of  numerical  functions,  too  many  segments  are 
required,  resulting  in  large  memory. 

For  a  given  error,  non-uniform  segmentation  of  the  do¬ 
main  uses  fewer  segments  than  uniform  segmentation  [9], 
[18].  However,  a  non-uniform  segmentation  requires  an  ad¬ 
ditional  circuit  that  maps  values  of  x  to  a  segment  num¬ 
ber.  Potentially,  this  is  a  complex  circuit.  To  simplify  the 
additional  circuit,  Lee  et  al.  [9]  have  proposed  a  special 
non-uniform  segmentation  for  the  sj-  ln(.v)  function.  Their 
method  produces  a  simple  circuit  by  restricting  the  segmen¬ 
tation  points,  and  results  in  fewer  segments  as  well  as  faster 
and  more  compact  NFG  than  produced  by  uniform  segmen¬ 
tation.  However,  it  is  not  always  optimum  for  the  given 
function  fix).  For  a  fast  and  compact  realization  of  any  non- 
uniform  segmentations,  we  use  an  LUT  cascade  proposed  by 
Sasao  et  al.  [17],  [18]  (see  Sect.  4).  By  using  the  LUT  cas¬ 
cade,  we  can  use  an  optimum  non-uniform  segmentation  for 
the  given  function  fix).  As  far  as  we  know,  there  is  no  other 
method  that  uses  an  optimum  non-uniform  segmentation. 

3.1  Non-uniform  Segmentation  Algorithm 

The  number  of  non-uniform  segments  depends  on  the  ap¬ 
proximation  polynomial.  A  more  accurate  approximation 
polynomial  requires  fewer  segments.  In  this  paper,  we 
use  the  2nd-order  Chebyshev  polynomials  to  approximate 
fix).  We  show  that  its  coefficients  are  computed  easily  and 
quickly:  it  is  suitable  for  fast  automatic  synthesis. 

For  a  segment  [5,  e]  of  fix),  the  maximum  approxima¬ 
tion  error  62 (.v,  e)  of  the  2nd-order  Chebyshev  approximation 
[11]  is  given  by 

eiis,  e)  =  (e  09)  max  |/(3)(x)|,  (3) 

where  /(3)  is  the  3rd-order  derivative  of  /.  From  (3),  ei is,  e) 
is  a  monotone  increasing  function  of  segment  width  e  -  s. 
Using  this  property,  we  partition  a  domain  into  as  wide  seg¬ 
ments  as  possible  such  that  the  approximation  error  is  less 
than  the  specified  approximation  error.  Figure  2  shows  the 
non-uniform  segmentation  algorithm.  The  inputs  for  this  al¬ 
gorithm  are  a  numerical  function  fix),  a  domain  [a,  b]  for 
x,  and  an  acceptable  approximation  error  ea.  Then,  this  al¬ 
gorithm  approximates  fix)  with  the  acceptable  approxima¬ 
tion  error  ea,  and  produces  t  segments  [50,  eo],  Ui,  e\], . . ., 
[st-\,et-\].  For  step  2  in  Fig.  2,  the  accurate  computation  of 
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Input: 

Numerical  function  f{x).  Domain  [a,  b]  for  x, 
Acceptable  approximation  error  ea. 

Output: 

Process: 

Segments  [so.eo],  [si,ei],.. L,- i,e»-i]. 

1. 

Let  so  =  a  and  i  =  0. 

2. 

Find  a  value  p  (>  5/)  where  62(5/,  p)  =  ea. 

3. 

If  p  >  b,  then  let  p  =  b. 

4. 

Let  ei  =  p  and  i  =  i  +  1 . 

5. 

If  p  =  b,  then  let  t  =  i,  and  stop  the  process. 

6. 

Else,  let  Si  =  p,  and  go  to  step  2. 

Fig.  2  Non-uniform  segmentation  algorithm  for  the  domain. 


the  value  p  where  e2( S;,  p)  =  is  difficult.  Thus,  we  obtain 
the  maximum  value  p'  satisfying  e2(Si,  //)  <  e„.  Such  // 
can  be  found  by  scanning  values  of  n-bit  input  x.  However, 
it  requires  0(2")  search,  and  is  time-consuming.  Therefore, 
we  compute  the  maximum  value  //  by  specifying  the  bits 
of  .r  from  the  MSB  to  the  LSB  such  that  e2(Sj,  p')  <  ea. 
This  requires  a  search  with  time  complexity  O(n).  In  the 
computation  of  e2 (si,p'),  the  value  of  maxs.<x<p -  l/(3)WI  is 
computed  by  a  nonlinear  programming  algorithm  [7]. 

3.2  Computation  of  the  Approximate  Value 

For  each  [ .v, ,  e,],  f(x)  is  approximated  by  the  corresponding 
quadratic  function  gfx).  That  is,  the  approximated  value 
y  of  fix)  is  computed  by  y  -  g,(x)  =  C2iX2  +  c ux  +  co/, 
where  the  coefficients  c2,,  c i,,  and  Co;  are  derived  from  the 
2nd-order  Chebyshev  approximation  polynomial  [11].  Sub¬ 
stituting  v  -  q,  +  qi  for  x  yields  the  transformation 

gfx)  =  c2i(x  -  qt f  +  (cu  +  2c2,qi)(x  -  qf) 

+C01  +  Cuq i  +  c2iqj  ■ 

Let  c'u  =  ci,-  +  2c1,ql  and  c'm  =  c0i  +  chqi  +  c^q1.  Then,  we 
have 

gf  x)  =  c2i(x  -  qi)1  +  c'yfx  -  qi)  +  c'0i.  (4) 

This  transformation  reduces  the  multiplier  size. 

4.  Architecture  for  NFGs 

Figure  3  shows  the  architecture  that  realizes  (4).  It  has  7 
units:  the  segment  index  encoder  (which  produces  the  seg¬ 
ment  index  i  for  [.v,,  e,]  given  the  value  x);  the  coefficients 
table  (which  stores  -qt,  c2i,  c'u,  and  cL);  an  adder  (which 
produces  x  +  (-<7,-));  a  squaring  unit;  two  multipliers;  and 
the  output  adder. 

A  segment  index  encoder  converts  x  into  a  segment  in¬ 
dex  i.  It  realizes  the  segment  index  function  segjunc(x)  : 

B"  — >  {0, 1 _ _  t—  1}  shown  in  Fig.  4  (a),  where  x  has  n  bits, 

B  =  {0, 1),  and  t  denotes  the  number  of  segments.  NFGs 
in  which  the  most  significant  bits  directly  drive  the  address 
inputs  of  a  coefficient  memory  need  no  segment  index  en¬ 
coder.  On  the  other  hand,  our  NFGs  based  on  non-uniform 
segmentation  require  this  additional  circuit.  Although  NFGs 
based  on  non-uniform  segmentation  have  a  smaller  coeffi¬ 
cients  table  than  those  based  on  uniform  segmentation,  the 
size  of  segment  index  encoder  may  have  to  be  considered  to 
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(a)  Segment  index  function.  (b)  LUT  cascade. 


Fig.  4  Segment  index  encoder. 

provide  a  fair  comparison.  In  [9],  to  simplify  the  segment  in¬ 
dex  encoder,  a  special  non-uniform  segmentation  is  used  for 
V-  ln(.v).  However,  it  does  not  correspond  to  an  optimum 
segmentation.  To  produce  compact  NFGs  for  a  wide  range 
of  functions,  it  is  important  to  guarantee  that  the  size  of  the 
segment  index  encoder  is  reasonable  for  any  non-uniform 
segmentation.  Our  synthesis  system  uses  the  LUT  cascade 
[8],  [16],  [17]  shown  in  Fig.  4(b)  to  realize  any  given  (opti¬ 
mum)  seg-func(x).  In  an  LUT  cascade,  the  interconnecting 
lines  between  adjacent  LUTs  are  called  rails.  The  size  of 
an  LUT  cascade  depends  on  the  number  of  rails.  Sasao  et 
al.  [17]  have  shown  that  the  size  of  an  LUT  cascade  depends 
on  the  number  of  segments.  Specifically, 

Theorem  1:  Let  seg-func(x)  be  a  segment  index  function 
with  t  segments.  Then,  there  exists  an  LUT  cascade  for 
seg-func(x)  with  at  most  [logo  f|  rails. 

[18]  shows  that  by  using  an  LUT  cascade,  we  can  generate 
compact  NFGs  for  a  wide  range  of  functions.  In  this  pa¬ 
per,  we  significantly  reduce  memory  sizes  of  the  coefficients 
table  and  the  LUT  cascade  by  reducing  the  number  of  seg¬ 
ments  using  the  quadratic  approximation.  Our  synthesis  sys¬ 
tem  uses  heterogeneous  MDDs  (Multi-valued  Decision  Di¬ 
agrams)  [13],  [14]  to  find  compact  LUT  cascades.  An  LUT 
cascade  passes  data  from  the  leftmost  LUT  to  the  rightmost 
LUT.  Since  each  LUT  in  an  LUT  cascade  operates  inde¬ 
pendently  and  concurrently,  we  can  easily  obtain  a  pipeline 
structure  by  assigning  each  LUT  to  one  pipeline  stage.  By 
using  LUTs  with  the  same  size,  we  can  easily  achieve  an 
efficient  pipeline  processing.  To  the  best  of  our  knowledge, 
this  is  the  first  method  for  realizing  any  seg-func(x). 

4. 1  Reduction  of  the  Size  of  the  Multiplier 

To  generate  a  fast  NFG,  reducing  multiplier  size  is  impor¬ 
tant.  Since  the  size  of  multipliers  depends  on  the  number  of 
bits  for  c2i,  c\  . ,  and  x  -  qj,  we  reduce  the  number  of  bits  to 
represent  these  values. 

To  reduce  the  number  of  bits  for  c%  and  cj ,  we  use  a 
scaling  method  [9].  We  represent  c2i  and  c'v  as  c2i  x  2-,2i  x 
2,2i  and  c’u  x  2~,li  x  2,l> ,  respectively.  And,  we  compute  the 
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products  C2i(x  -  qi)2  and  c'u(x  -  qi)  using  multipliers  for  the 
right-shifted  multiplicand,  C2i‘x2~lli(x—ql)2  and  c,1;.x2_/i‘  (a  - 
qi),  and  left  shifters  to  restore  the  products.  Applying  right 
shifts  reduces  the  number  of  bits  for  C2;  x  2~,2i  and  c\  ■  x  2~lli 
(i.e.  the  multiplier  sizes)  by  rounding  the  hi  LSBs  of  C2/  and 
In  of  c'u,  while  increasing  the  rounding  errors.  In  Appendix, 
we  find  the  largest  hi  and  In  for  each  segment  that  preserve 
the  given  acceptable  error.  When  hi  and  In  are  0  for  all  the 
segments,  we  do  not  use  scaling  method. 

Example  1:  Consider  a  24-bit  precision  NFG  for  \j-  lu(x). 
By  using  the  scaling  method,  C2; ,  hi,  c'u,  and  In  have  13  bits, 
5  bits,  20  bits,  and  4  bits,  respectively,  and  they  produce  an 
NFG  with  memory  size:  173, 056 bits,  operating  frequency: 
130MFIz,  and  16  DSP  units.  On  the  other  hand,  without  the 
scaling  method,  C2/  and  c\  have  37  bits  and  28  bits,  respec¬ 
tively,  and  they  produce  an  NFG  with  larger  memory  size: 
184,  832  bits,  much  lower  operating  frequency:  71  MHz, 
and  more  DSP  units:  20  units.  (End  of  Example) 

Next,  we  reduce  the  value  of  x  -  q,.  The  number  of 
bits  for  x  -  qt  influences  the  sizes  of  the  squaring  unit  and 
multipliers.  Thus,  reducing  the  value  of  x  -  q,  reduces  the 
sizes  of  the  squaring  unit  and  multipliers,  and  also  the  error. 
From  (4),  we  can  choose  any  value  for  q,.  To  reduce  the 
value  of  x  -  qi,  for  a  segment  [ ,  e,],  we  set  qj  =  ( Sj  +  e,-)/ 2. 
Then,  we  have  \x  -  q,\  <  (e,-  -  j,j/ 2.  Thus,  reducing  the 
segment  width  cq  -  Sj  reduces  the  value  for  x  -  q,.  However, 
this  also  increases  the  number  of  segments,  and  results  in 
increased  memory  size.  We  show  a  reduction  method  of 
segment  width  without  increasing  the  memory  size. 

The  coefficients  table  in  Fig.  3  has  2k  words,  where 
k  =  riog2  f]  and  t  is  the  number  of  segments.  Therefore, 
we  can  increase  the  number  of  segments  up  to  t  —  2k  with¬ 
out  increasing  the  memory  size.  From  Theorem  1,  the  size 
of  the  LUT  cascade  also  depends  on  the  value  of  k.  Thus, 
increasing  the  number  of  segments  to  t  —  2k  rarely  increases 
the  size  of  the  LUT  cascade.  We  reduce  the  size  of  seg¬ 
ments  by  dividing  the  largest  segment  into  two  equal  sized 


segments  up  to  t  =  2k.  This  reduces  the  rounding  error  as 
well  (see  Appendix). 

5.  Experimental  Results 

5.1  Number  of  Segments  and  Computation  Time 

Table  1  compares  the  number  of  segments  for  various  ap¬ 
proximation  methods  for  the  functions  in  [17].  In  this  table, 
Entropy,  Sigmoid,  and  Gaussian  are 

Entropy  =  -a  log?  x  —  (1  —  _r)log2(l  -  x), 

1  1 

Sigmoid  =  - — ,  and  Gaussian  -  — —  e  2 

1  +  e~4x  V2 n 

In  Table  1 ,  the  columns  “Linear  Uniform”  and  “Linear  Non” 
show  the  number  of  uniform  and  non-uniform  segments  [  1 8] 
for  linear  approximation,  respectively.  The  columns  “2nd- 
Chebyshev  Uniform”  and  “2nd-Chebyshev  Non”  show  the 
number  of  uniform  and  non-uniform  segments  for  the  2nd- 
order  Chebyshev  approximation,  respectively.  The  columns 
“Time”  show  the  CPU  time  for  our  non-uniform  segmenta¬ 
tion  algorithm  applied  to  functions,  in  milliseconds. 

Table  1  shows  that,  for  many  functions,  the  2nd-order 
Chebyshev  approximations  require  many  fewer  segments 
than  the  linear  approximations.  However,  for  some  func¬ 
tions,  such  as  \j-  Iii(a'),  the  2nd-order  Chebyshev  approxi¬ 
mation  based  on  uniform  segmentation  requires  many  more 
segments  than  the  linear  and  2nd-order  Chebyshev  approx¬ 
imations  based  on  non-uniform  segmentations.  Many  ex¬ 
isting  polynomial  approximation  methods  are  based  on  uni¬ 
form  segmentation.  For  trigonometric  and  exponential  func¬ 
tions,  the  methods  based  on  uniform  segmentation  require 
relatively  few  segments.  However,  for  some  kinds  of  func¬ 
tions  such  as  sj-  Iii(a'),  the  uniform  methods  require  ex¬ 
cessively  many  segments,  even  if  quadratic  approximation 
is  used.  On  the  other  hand,  our  quadratic  approximation 
based  on  non-uniform  segmentation  requires  many  fewer 


Table  1  Number  of  segments  for  various  approximation  methods. 


Function 

f(x ) 

Domain 

Acceptable  approximation  error: 
(x  has  15-bit  accuracy) 

Acceptable  approximation  error:  2  25 
(x  has  23-bit  accuracy) 

Line* 

Uniform 

r 

Non 

2nd-Cheb 

Uniform 

yshev 

Non 

Time 

[msecl 

IB1 

MlliMIMi 

Time 

[msecl 

2X 

[0, 1] 

129 

128 

9 

7 

10 

2,049 

65 

44 

l/x 

[1.2) 

128 

124 

16 

11 

10 

2,048 

1,982 

128 

64 

V7 

[1/32,  2) 

2,016 

193 

252 

24 

10 

32,256 

2,016 

138 

1/ V7 

[1,2) 

128 

46 

16 

8 

10 

2,048 

128 

46 

70 

log? (a) 

[1,2) 

128 

128 

16 

10 

10 

2,048 

128 

56 

90 

ln(.v) 

[1,2) 

128 

89 

16 

9 

10 

2,048 

1,437 

128 

50 

80 

sin(^x) 

[0,  1/2) 

257 

127 

17 

12 

10 

4,097 

129 

74 

70 

COS(7TX) 

[0,  1/2) 

257 

127 

17 

12 

10 

4,097 

129 

74 

80 

tan(/rx) 

[0.  1/4) 

257 

112 

33 

12 

10 

4,097 

1,787 

129 

73 

140 

V~  ln(x) 

[1/32,  1) 

31,744 

354 

31,744 

52 

90 

8,126,464 

5,933 

8,126,464 

331 

840 

tan  2(rrx)  +  1 

[0.  1/4) 

513 

256 

33 

17 

20 

8,193 

257 

101 

200 

Entropy 

[1/256,  255/256] 

2,033 

520 

509 

30 

32,513 

4,065 

234 

260 

Sigmoid 

[0,  1] 

129 

127 

33 

13 

20 

129 

76 

220 

Gaussian 

_ [0.  1/21 _ 

33 

32 

5 

4 

10 

513 

512 

33 

18 

30 

Average 

2,706 

170 

2,337 

17 

19 

587,466 

2,739 

580,995 

99 

171 

Linear:  Linear  approximation.  2nd-Chebyshev:  2nd-order  Chebyshev  approximation. 

Uniform:  Uniform  segmentation.  Non:  Non-uniform  segmentation. 

Time:  CPU  time  for  our  non-uniform  segmentation  algorithm  conducted  on  the  following  environment. 

System:  Sun  Blade  2500  (Silver),  CPU:  UltraSPARC-IIIi  1.6  GHz,  Memory:  6  GB,  OS:  Solaris  9,  and  C  compiler:  gcc  -02. 
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Table  2  Comparison  with  linear  approximation  based  on  non-uniform 
segmentation. 


Function 

fix) 

16-bit  precisio 
(15-bit  accuracy 

a 

/i 

24-bit  precision 
(23 -bit  accuracy) 

Memor 

Tbitsl 

R 

r%i 

Memory  Tbits] 

R 

m 

Linear 

Ouad. 

Linear 

Ouad. 

2X 

20,992 

1,112 

5 

696,320 

19,072 

3 

l/x 

21,248 

2,432 

11 

700,416 

19,136 

3 

y* 

43,776 

5,536 

13 

1,425,408 

86,784 

6 

1/  fx 

10,176 

1,104 

11 

343,040 

19,008 

6 

l°g2(-r) 

20,864 

2,464 

12 

694,272 

19,072 

3 

ln(.v) 

20,096 

2,448 

12 

700,416 

19,136 

3 

sin(7TA:) 

19,456 

2,336 

12 

661,504 

38,656 

6 

COS(7TX) 

19,584 

2,336 

12 

663,552 

38,784 

6 

tan(7rx) 

19,712 

2,304 

12 

667,648 

38,272 

6 

V~  ln(x) 

74,240 

11,264 

15 

2,662,400 

173,056 

7 

tan  2(nx)  +  1 

37,632 

4,960 

13 

1,290,240 

39,040 

3 

Entropy 

106,496 

10,688 

10 

3,768,320 

83,968 

2 

Sigmoid 

21,120 

2,432 

12 

702,464 

40,320 

6 

Gaussian 

4,416 

444 

10 

156,672 

8,384 

5 

Average 

31,415 

3,704 

11 

1,080,905 

45,906 

4 

Memory:  Memory  size.  Linear:  Linear  approximation  [18]. 

Quad.:  2nd-order  Chebyshev  approximation.  R:  Ratio. 


segments  for  a  wide  range  of  functions.  Also,  Table  1  shows 
that  the  CPU  time  to  compute  the  segmentation  strongly  de¬ 
pends  on  the  number  of  segments.  Smaller  acceptable  ap¬ 
proximation  error  requires  more  segments  and  longer  com¬ 
putation  time.  However,  Table  1  shows  that,  for  all  functions 
in  the  table,  the  CPU  times  are  shorter  than  1  second  when 
the  acceptable  approximation  error  is  2' 25 . 

These  results  show  that,  for  various  functions,  our  seg¬ 
mentation  algorithm  partitions  a  domain  into  fewer  non- 
uniform  segments  quickly,  and  it  is  useful  for  fast  synthesis. 

5.2  Memory  Sizes  of  Various  NFGs 

This  section  compares  the  memory  sizes  of  our  NFGs  with 
three  existing  NFGs  [3],  [4],  [18].  Table  2  compares  with 
NFGs  based  on  linear  approximation  and  non-uniform  seg¬ 
mentation  shown  in  [18].  In  Table  2,  the  columns  “R”  show 
the  following  values: 

^  memory  size  of  quadratic  approximation  ^OO(V) 
memory  size  of  linear  approximation 

Table  2  shows  that  NFGs  using  quadratic  approximation  re¬ 
quire  much  smaller  memory  than  ones  using  linear  approx¬ 
imation.  Especially,  24-bit  precision  NFGs  using  quadratic 
approximation  can  be  implemented  with,  on  average,  only 
4%  of  the  memory  size  needed  for  a  linear  approxima¬ 
tion.  From  the  relation  between  precision  and  memory  size 
shown  in  Table  2,  we  can  see  that  increasing  the  precision 
decreases  the  ratio  of  memory  sizes  in  NFGs. 

Figure  5  shows  a  plot  of  the  total  memory  size  and  the 
required  precision  for  NFGs  based  on  linear  and  quadratic 
approximations  of  1/x.  Since  memory  size  of  quadratic 
NFG  increases  more  slowly  than  that  of  linear  NFG,  as 
the  precision  increases,  we  conjecture  that  for  precisions 
higher  than  24-bit,  our  quadratic  NFGs  will  require  reason¬ 
able  memory  size.  Unfortunately,  we  could  not  verify  that 
because  of  the  precision  of  our  NFG  synthesis  tool.  Table  3 
and  Table  4  compare  our  NFGs  with  NFGs  using  a  5th- 
order  Taylor  expansion  [3]  and  NFGs  using  2nd-order  min- 


Table  3  Comparison  with  5th-order  approximation  based  on  uniform 
segmentation. 


Func. 

fix) 

Domain 

Accuracy 

1  Memory  size  Tbitsl 

Ratio 

[%] 

5th-order 

(Uniform) 

Quad. 

(Non) 

sin(7rx) 

[0.  1/4] 

2-l> 

70,528 

18,048 

26 

exp(;r) 

[0,  1] 

2“24 

82,432 

43,136 

52 

2X  -  1 

[0,  1] 

2~24 

89,600 

19,968 

22 

5th-order:  5 th-order  approximation  [3]. 
Quad.:  2nd-order  Chebyshev  approximation. 


Table  4  Comparison  with  quadratic  approximation  based  on  uniform 
segmentation. 


Func. 

Domain 

Accuracy 

Memory  size  [bits] 

Ratio 

fix) 

Minimax 

(Uniform) 

Cheb. 

(Non) 

[%] 

sin(^x/4) 

[0, 1) 

2~24 

16,288 

19,200 

118 

2X-  1 

[0, 1) 

2~ 16 

2,208 

2,512 

114 

Minimax:  2nd-order  minimax  approximation  [4]. 
Cheb.:  2nd-order  Chebyshev  approximation. 


imax  approximation  by  the  Remez  algorithm  [4],  respec¬ 
tively.  Both  approximations  in  [3],  [4]  are  based  on  uni¬ 
form  segmentation.  Thus,  their  NFGs  require  no  segment 
index  encoder.  On  the  other  hand,  since  our  approximation 
is  based  on  non-uniform  segmentation,  the  memory  size  is 
obtained  by  summing  the  memory  sizes  of  the  coefficients 
table  and  the  segment  index  encoder.  As  shown  in  Table  1, 
for  trigonometric  and  exponential  functions,  the  difference 
of  the  number  of  uniform  segments  and  non-uniform  seg¬ 
ments  is  not  so  large  under  the  same  approximation  poly¬ 
nomial.  For  such  functions,  NFGs  based  on  uniform  seg¬ 
mentation  (needing  no  segment  index  encoder)  often  require 
smaller  memory  than  non-uniform  segmentations.  Although 
our  NFGs  require  the  segment  index  encoder  and  use  ap¬ 
proximation  polynomials  with  larger  approximation  error 
than  approximation  polynomials  in  [3],  [4],  our  NFGs  for 
such  functions  are  implemented  with  only  22%  to  52%  of 
the  memory  sizes  of  NFGs  in  [3],  and  with  memory  size 
comparable  to  [4].  In  [3],  [4],  memory  sizes  of  NFGs  for 
\pc  and  V—  ln(x)  are  unavailable.  However,  from  Table  1, 
we  can  see  that  the  memory  size  of  their  NFGs  for  \fx  and 
V~  ln(.v)  is  excessively  large.  On  the  other  hand,  our  NFGs 
can  realize  a  wide  range  of  functions  with  small  memory 
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Table  5  FPGA  implementation  of  NFGs  for  linear  and  quadratic  approximations. 


FPGA  device:  Altera  Stratix  (EP1S10F484C5:  10,570  logic  elements,  48  DSP  units) 

Logic  synthesis  tool: _ Altera  QuartusII  5.0  with  speed  optimization  option  and  timing  requirement  of  200  MHz 


Function 

f(x) 

16-bit  precision 

24-bit  precision 

Logic  elements 

DSP  units 

Freq.  [MHzl 

Logic  elements 

DSP  units 

Freq. 

IMHzl 

Linear 

Ouad. 

Linear 

Ouad. 

Linear 

Ouad. 

Linear 

Ouad. 

Linear 

Ouad. 

Linear 

Ouad. 

2* 

167 

482 

2 

4 

195 

185 

604 

758 

2 

10 

- 

131 

l/x 

204 

376 

2 

4 

234 

186 

636 

859 

2 

10 

- 

134 

Vx 

270 

496 

2 

4 

237 

179 

1,211 

822 

2 

16 

- 

124 

l/sfx 

186 

475 

2 

4 

237 

186 

402 

753 

2 

10 

- 

131 

log2(.v) 

163 

381 

2 

4 

194 

186 

597 

757 

2 

10 

- 

131 

ln(.v) 

170 

379 

2 

4 

197 

185 

416 

863 

2 

10 

- 

131 

sin(/rx) 

154 

424 

2 

4 

197 

192 

480 

646 

8 

10 

- 

134 

COS(7TX) 

172 

354 

2 

4 

237 

179 

412 

647 

8 

10 

- 

131 

tan(/rx) 

234 

382 

2 

4 

237 

178 

655 

604 

2 

10 

- 

131 

V~  ln(x) 

304 

623 

2 

10 

215 

135 

854 

942 

8 

16 

- 

130 

tan2(/r.x)  +  1 

132 

282 

2 

4 

194 

215 

991 

720 

2 

10 

- 

135 

Entropy 

141 

403 

2 

4 

235 

206 

1,370 

914 

2 

16 

- 

128 

Sigmoid 

167 

430 

2 

4 

194 

191 

627 

706 

2 

10 

- 

131 

Gaussian 

181 

419 

2 

4 

237 

186 

303 

747 

2 

10 

216 

129 

Average 

189 

422 

2 

4 

217 

185 

683 

767 

3 

11 

- 

131 

Linear:  Linear  approximation  [18].  Quad.:  2nd-order  Chebyshev  approximation.  Freq. :  Operating  frequency. 

-:  NFGs  cannot  be  mapped  into  the  FPGA  due  to  the  excessive  memory  size. 


Table  6  Comparison  with  table-based  method:  STAM  [22]. 


FPGA  device: 


Altera  Stratix  (EP1S25F780C5:  25,660  logic  elements,  80  DSP  units) 


Function 

fix) 

Domain 

In 

prec. 

Out 

prec. 

Memorv  size 

bits! 

Operating  frequencv  [MHz] 

Int 

Frac 

Int 

Frac 

STAM 

Quad. 

Ratio  r%l 

STAM 

Ouad. 

Ratio  I%1 

l/x 

[1,2) 

1 

23 

1 

23 

651,264 

19,136 

3 

184 

131 

71 

sin(x) 

[0, 1) 

0 

24 

0 

24 

491,520 

40,832 

8 

205 

131 

64 

2X 

[0,1) 

0 

24 

1 

23 

356,352 

19,136 

5 

198 

132 

66 

Quad.:  our  NFG. 

Int:  Number  of  integer  bits. 


In  prec.:  Input  precision. 

Frac:  Number  of  fractional  bits. 


Out  prec.:  Output  precision. 


size. 

5.3  FPGA  Implementation 

This  section  compares  the  FPGA  implementation  results  of 
our  NFGs  with  two  existing  NFGs  [18],  [22].  Table  5  shows 
FPGA  implementation  results  of  our  NFGs  and  NFGs  based 
on  linear  approximation  using  non-uniform  segmentation 
[18]. 

Since  the  architecture  of  a  linear  NFG  [18]  is  simpler 
than  that  of  a  quadratic  NFG,  linear  NFGs  are  faster,  and 
require  fewer  logic  elements  and  DSP  units  than  quadratic 
NFGs.  However,  linear  approximations  require  more  seg¬ 
ments  and  larger  memory  than  quadratic  approximations,  as 
shown  in  Table  1  and  Table  2.  Table  5  shows  that  there  are 
not  enough  resources  in  the  smallest  device  in  the  Stratix 
family  to  achieve  24-bit  precision  using  linear  approxima¬ 
tion  for  all  functions,  except  Gaussian.  This  is  due  to 
the  excessive  memory  size  (although  many  logic  elements 
and  DSP  units  are  unused).  The  most  crucial  issue  in  the 
FPGA  implementation  is  the  constraints  on  these  hardware 
resources.  For  24-bit  precision,  the  linear  approximation 
requires  a  larger  FPGA  due  to  the  excessive  memory  size. 
However,  in  larger  FPGAs,  more  logic  elements  and  DSP 
units  are  left  unused.  On  the  other  hand,  a  quadratic  NFG 
can  be  implemented  with  a  smaller  FPGA,  since  it  requires 
much  smaller  memory  size  than  a  linear  NFG.  In  fact,  we 
implemented  24-bit  precision  NFGs  with  lower  cost  and 
smaller  FPGAs  (Cyclone  II),  using  quadratic  approximation 
instead  of  linear  approximation. 


To  show  the  performance  of  our  NFGs,  we  compare 
with  a  well-known  table-based  NFG,  STAM  [22],  that  uses 
lst-order  Taylor  expansion  and  uniform  segmentation.  Ta¬ 
ble  6  compares  memory  size  and  operating  frequency.  Since 
the  STAM  requires  tables  and  a  multiple-input  adder,  but  re¬ 
quires  no  multiplier,  it  is  faster.  However,  it  requires  exces¬ 
sive  memory  size  and  therefore  larger  FPGAs.  On  the  other 
hand,  our  NFGs  achieve  about  60%  to  70%  of  the  operating 
frequency  of  the  STAM  with  much  smaller  memory  size. 

6.  Conclusion  and  Comments 

We  have  demonstrated  an  architecture  and  a  synthesis 
method  for  compact  NFGs  for  trigonometric,  logarithmic, 
square  root,  reciprocal,  and  combinations  of  these  functions. 
Our  architecture  can  efficiently  realize  any  non-uniform  seg¬ 
mentation  using  a  compact  LUT  cascade,  and  can  approx¬ 
imate  many  numerical  functions  by  quadratic  polynomials. 
Therefore,  our  architecture  is  suitable  for  automatic  synthe¬ 
sis  of  fast  and  compact  NFGs.  Experimental  results  show 
that  our  synthesis  method  can  approximate  a  wide  range  of 
functions  with  a  small  number  of  non-uniform  segments, 
and  generate  NFGs  with  small  memory  size.  For  24-bit 
precision,  our  NFGs  can  be  implemented  with,  on  aver¬ 
age,  only  4%  of  the  memory  size  of  NFGs  based  on  lin¬ 
ear  approximation  and  non-uniform  segmentation,  and  with, 
on  average,  only  33%  of  the  memory  size  of  NFGs  based 
on  the  5th-order  approximation  and  uniform  segmentation. 
NFGs  based  on  the  linear  approximation  are  faster  than  the 
quadratic  ones,  but  for  high-precision,  they  require  a  large 
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FPGA  due  to  the  excessive  memory  size.  On  the  other  hand, 
our  quadratic  NFGs  can  achieve  about  70%  of  the  through¬ 
put  of  the  table-based  NFG  using  only  a  few  percent  of  the 
memory,  and  can  be  implemented  with  a  more  compact  and 
low-cost  FPGA. 

Our  24-bit  precision  NFGs  can  be  used  to  compute  a 
significand  of  an  IEEE  single-precision  floating-point  num¬ 
ber.  Since  our  NFGs  are  compact,  we  conjecture  that  our 
NFGs  will  be  practical  also  for  higher-precision  than  24-bit. 
However,  we  could  not  verify  the  accuracy  of  NFGs  with 
higher-precision  than  24-bit  due  to  the  errors  of  NFG  syn¬ 
thesis  tool  developed  by  C  language.  Thus,  we  have  to  de¬ 
velop  a  more  accurate  NFG  synthesis  tool  and  a  verification 
tool  in  the  future. 
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Appendix:  Error  Analysis 

We  analyze  the  error  for  our  NFGs  and  show  a  method  to 
obtain  the  appropriate  bit  sizes  for  the  units  in  our  architec¬ 
ture.  Our  synthesis  system  applies  the  method  in  [5]  to  the 
NFG  shown  in  Fig.  3. 

A.l  Error  of  NFG 

There  are  two  kinds  of  errors:  approximation  error  and 
rounding  error.  The  approximation  error  is  given  by  ea,  as 
shown  in  Sect.  3.  Thus,  this  section  focuses  on  rounding 
error.  We  ignore  the  integer  bits  in  this  analysis  because 
rounding  errors  are  independent  of  integer  bits.  We  assume 
there  is  at  least  one  fractional  bit. 

Errors  in  Coefficients:  The  coefficients  C2i,  c'u,  and  c'0i  for 
the  quadratic  approximation  gt(x)  are  rounded  to  M2-bit,  u\- 
bit,  and  no-bit  accuracy,  respectively.  That  is,  due  to  round¬ 
ing,  the  coefficients  become 
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cn  +  «2,  c'u  +  a\ ,  and  c'0i  +  ag, 

(|a2|  <  2-(U2+1\  | ari|  <  2~(“1+1\  |a0|  <  2~("0+1)), 

where  a2,ai,  and  ag  are  rounding  errors  of  c2 c'u,  and  c'0i, 
respectively.  In  this  case,  we  need  no  hardware  for  rounding 
the  coefficients  because  it  is  precomputed  before  storing  in 
the  coefficients  table.  Since  rounding  yields  a  more  accurate 
result  than  the  truncation,  we  choose  rounding. 

Error  in  Squaring  Unit:  When  input  x  has  mln  bits,  our 
synthesis  system  ensures  that  qt  has  also  m,n  bits.  Therefore, 
the  value  of  (x  -  qi)  also  has  m,„  bits,  and  has  no  rounding 
error.  Although  the  value  of  (x  -  qi)2  has  2 m,„  bits,  the  value 
of  (x  -  qi)2  is  truncated  to  w3  bits  in  order  to  reduce  the 
size  of  succeeding  multiplier,  where  m3  <  2 m,„.  We  choose 
truncation  because  it  does  not  require  additional  hardware, 
while  rounding  requires  half  adders  at  the  output  of  squaring 
unit.  Thus,  the  succeeding  multiplier  uses  the  value  of 

(x  -  q ^  -  or3  (0  <  a3  <  2~“3  -  2~2m"  ), 

where  a3  is  the  rounding  error  for  the  truncation. 

Errors  in  Multipliers:  As  shown  in  the  two  previous  para¬ 
graphs,  c2i  and  c\ .  are  stored  as  c3,  +  a2  and  c\ .  +  ax  in  the 
ROM,  and  (x  -  qi)  is  changed  to  (x  -  qi)  -  a3  by  trunca¬ 
tion.  These  values  change  the  original  products  C2,(x  -  qi)2 
and  c'u(x  -  qi)  to 

(c2i  +  a2){(x  -  qt  f  -  a3 }  and  (A-  1 ) 

{c'u  +  ai){x-  qi).  (A- 2) 

The  values  of  (A-  1)  and  (A- 2)  have  ( u2  +  u2)  bits  and 
(u i  +  m,„)  bits,  respectively,  and  they  are  truncated  to  no 
bits  in  order  to  match  the  addition  with  c'0/.  Note  that 
Mo  ^  min(M2  +  m3,«!  +  m,„).  Following  a  method  shown 
in  [20],  the  ( ug  +  l)th-fractional  bit  d-Xuo+ q  of  (A-  1)  is  set  to 
1  after  the  truncation.  Thus,  the  adder  uses  the  values  of 

(c2;  +  a2){ (x  -  qi)2  -  a3)  -  a4  and 
(c'u  +  ai)(x  -  qt)  -  a5, 

where  a4  and  as  are  the  rounding  errors  for  the  truncations 
of  (A-  1)  and  (A-  2):  0  <  a4  <  2^(u°+,)  -  2-("2+“3)  and  0  < 
a5  <  2~"°  -  2~(ui+'"in\ 

Errors  in  Adder:  From  previous  paragraphs,  the  original 
sum  c2l-(x  -  qi)2  +  c'lj;(x  -  qi)  +  c'0i  is  changed  to 

(c2i  +  a2){(x  -  qi)2  -  a3]  +  (c'u  +  a i)(x  -  qi)  +  c'0j 
+  ao  -  04  -  a^.  (A-  3) 

This  value  has  uq  bits,  and  is  rounded  to  mout  bits  (output 
accuracy  given  in  the  specification),  where  mout  <  ug.  Thus, 
the  value  of  gr,(x)  is  changed  to 

(c2i  +  a2){(x  -  qi)2  -  a3]  +  {c'u  +  ai)(x  -  qi)  +  c'0i 

+  Qfg  —  &4  —  £^5  +  Qfg, 

where  a6  is  the  error  for  the  rounding  to  mout  bits.  Since 
d-(Uo+ 1)  of  (A-  1)  is  set  to  1,  d-^+i)  of  (A-  3)  is  also  1,  and 
we  have  |ag|  <  2_(m"'+1,  -  2_("0+1).  Note  that  <jL(„0+i)  is  not 


implemented  in  hardware  [20].  By  expanding  and  rearrang¬ 
ing  this,  we  have 

gi(x)  +  a2{(x  -  qt)2  -  a3]  +  ai(x  -  qi)  +  a0 

-c2ia3  -  a4  -  a5  +  a6.  (A- 4) 

This  is  an  output  value  of  the  NFG  including  rounding  er¬ 
ror.  (A-  4)  has  the  maximum  rounding  error  when  ag  — 

_2A«o+1),  ai  _  _2-(«i+D)  ai  =  _2-(«2+D)  ai  -  2~«3  _  2~2m‘\ 

a4  =  2-("o+1)  -  2~(U2+,,2K  as  =  2~“°  -  2~(ui+mi"\  and  a6  = 
-(2_('"“"+1)  -  2_("0+1)),  where  we  assume  that  the  values  of 
( x-qi )  and  c^i  are  positive.  Therefore,  the  maximum  round¬ 
ing  error  er  is 

er  =  2 -(U2+l\maxseg2  -  3  •  2-“3  +  2~2m‘ ") 

_)_2-("i+1  \naxseg  +  3  •  2~("0+l) 

+incix-C2 (2~W3  -  2~2m‘")  -  2-(,,l+l"i")  +  2-(m»'+1), 

where  maxseg  and  max_c2  are  the  maximum  values  of  |x  - 
qi  and  |c2,|,  respectively. 

A.2  Calculation  of  Bit  Sizes  for  Units 

The  number  of  bits  for  the  integer  part  is  calculated  as 
[log 2(max-value  +  1)]  +  1,  where  maxjualue  is  an  integer, 
and  denotes  the  maximum  absolute  value  of  the  range. 

On  the  other  hand,  the  number  of  bits  for  the  frac¬ 
tional  part  is  calculated  using  the  result  of  the  error  anal¬ 
ysis.  From  the  error  analysis,  an  NFG  with  an  acceptable 
error  e  is  achieved  when  ea  +  er  <  e,  where  e„  and  er  are  the 
maximum  approximation  error  and  the  rounding  error,  re¬ 
spectively.  To  generate  fast  and  compact  NFGs,  we  find  the 
minimum  ug,ui,u2,  and  m3  that  satisfy  this  relation  using 
nonlinear  programming  [7]  that  minimizes  ug  +  u\  +  u2  +  m3 
with  the  constraint  e„  +  er  <  e. 

Calculation  of  Shift  Bits  l2j  and  In:  From  (A- 1)  and  (A-  2), 
the  maximum  errors  in  the  multipliers  e\  and  e2  are 

ex  =  2 -iU2+1)(max_seg2  -  2~"3  +  2~2m‘") 

+max-C2(2~“3  -  2~2m‘")  and 
e2  =  2~<u'+]>maxseg, 

respectively.  Since  maxseg  and  max_c2  are  the  maximum 
values  of  |x  -  q\  and  |c2,|,  respectively,  there  are  positive  in¬ 
tegers  l2i  and  In  that  satisfy  the  following  relations  for  each 
segment: 

2/2'2— (M2+i){|x  -  qi2  -  2-"3  +  2-2"'"1 } 

+|c2,I(2-“3  -  2-2m"‘)  <  ei  and 
2h‘2~(ui+1)\x  -  qi  <  e2. 

We  transform  these  into 

2— («2— /2,  +  l){|x  _  ?.|2  _  2-«3  +  2-2m-»} 

+|c2,|(2-"3  -  2-2m"')  <  e,  and  (A-  5) 

2-^+%  -  qi  <  e2.  (A- 6) 

From  (A-  5)  and  (A-  6),  for  each  segment,  coefficients  c2i  and 
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c'  can  be  rounded  to  (w2  -  hi)  and  (u i  -  I u)  bits  within  an 
acceptable  error,  respectively.  That  is,  hi  and  hi  can  be  used 
for  the  scaling  method  in  Sect.  4.  From  (A-  5)  and  (A-  6),  we 
have 

,  ,  ( nunie.  \  ,  ,  ,  (max sea \ 

,2,slog2(^)  and 

where 


nume.  =  2  <Ml+v> (maxseg2  -2  "3  +  2  2m'") 

+  (max-C2  -  c2,)(2~"3  -  2~2mi")  and 
deno.  =  2'(“0+1,{|x  -  qt\2  -  +  2^2m-}. 


And,  we  choose 


log2 


maxseg  \ 

\x-qt\  )_ 
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